546 research outputs found
Foundational principles for large scale inference: Illustrations through correlation mining
When can reliable inference be drawn in the "Big Data" context? This paper
presents a framework for answering this fundamental question in the context of
correlation mining, with implications for general large scale inference. In
large scale data applications like genomics, connectomics, and eco-informatics
the dataset is often variable-rich but sample-starved: a regime where the
number of acquired samples (statistical replicates) is far fewer than the
number of observed variables (genes, neurons, voxels, or chemical
constituents). Much of recent work has focused on understanding the
computational complexity of proposed methods for "Big Data." Sample complexity
however has received relatively less attention, especially in the setting when
the sample size is fixed, and the dimension grows without bound. To
address this gap, we develop a unified statistical framework that explicitly
quantifies the sample complexity of various inferential tasks. Sampling regimes
can be divided into several categories: 1) the classical asymptotic regime
where the variable dimension is fixed and the sample size goes to infinity; 2)
the mixed asymptotic regime where both variable dimension and sample size go to
infinity at comparable rates; 3) the purely high dimensional asymptotic regime
where the variable dimension goes to infinity and the sample size is fixed.
Each regime has its niche but only the latter regime applies to exa-scale data
dimension. We illustrate this high dimensional framework for the problem of
correlation mining, where it is the matrix of pairwise and partial correlations
among the variables that are of interest. We demonstrate various regimes of
correlation mining based on the unifying perspective of high dimensional
learning rates and sample complexity for different structured covariance models
and different inference tasks
On Measure Transformed Canonical Correlation Analysis
In this paper linear canonical correlation analysis (LCCA) is generalized by
applying a structured transform to the joint probability distribution of the
considered pair of random vectors, i.e., a transformation of the joint
probability measure defined on their joint observation space. This framework,
called measure transformed canonical correlation analysis (MTCCA), applies LCCA
to the data after transformation of the joint probability measure. We show that
judicious choice of the transform leads to a modified canonical correlation
analysis, which, in contrast to LCCA, is capable of detecting non-linear
relationships between the considered pair of random vectors. Unlike kernel
canonical correlation analysis, where the transformation is applied to the
random vectors, in MTCCA the transformation is applied to their joint
probability distribution. This results in performance advantages and reduced
implementation complexity. The proposed approach is illustrated for graphical
model selection in simulated data having non-linear dependencies, and for
measuring long-term associations between companies traded in the NASDAQ and
NYSE stock markets
Robust Multiple Signal Classification via Probability Measure Transformation
In this paper, we introduce a new framework for robust multiple signal
classification (MUSIC). The proposed framework, called robust
measure-transformed (MT) MUSIC, is based on applying a transform to the
probability distribution of the received signals, i.e., transformation of the
probability measure defined on the observation space. In robust MT-MUSIC, the
sample covariance is replaced by the empirical MT-covariance. By judicious
choice of the transform we show that: 1) the resulting empirical MT-covariance
is B-robust, with bounded influence function that takes negligible values for
large norm outliers, and 2) under the assumption of spherically contoured noise
distribution, the noise subspace can be determined from the eigendecomposition
of the MT-covariance. Furthermore, we derive a new robust measure-transformed
minimum description length (MDL) criterion for estimating the number of
signals, and extend the MT-MUSIC framework to the case of coherent signals. The
proposed approach is illustrated in simulation examples that show its
advantages as compared to other robust MUSIC and MDL generalizations
- …